handle unresponsive nodes #474

evanharmon · 2024-08-27T23:18:45Z

Motivation

Updates the eip-operator to handle unresponsive egress nodes. A new label is set on egress nodes indicating their status. New logic is added to disassociate EIPs from unresponsive nodes. A separate function handles cleaning up unresponsive egress nodes to update their status label.

Important: The README.md has been updated to reflect this repo being archived in the near future.

Notes for reviewer

Clippy warnings did not like the nested conditional blocks on match statements. I did some reasonable refactoring to meet the warnings. Not sure if it actually helps readability. What was more important to me was minimizing unnecessary API calls where possible.

Testing

I've tested egress nodes going unresponsive and egress nodes being rolled as well. The code changes match the existing behavior except now egress moves to the new node when the old one goes unresponsive.

… Unknown

pH14

Ooh very cool to see progress on this issue. My main comment is structural -- I think we should try to remove the global reasoning about nodes from the per-node reconciliation. The least invasive possibility I can think of is to pull the egress gateway labeling into its own control loop, perhaps worth a shot!

eip_operator/src/controller/node.rs

pH14 · 2024-08-29T14:26:18Z

eip_operator/src/controller/node.rs

+                // The EIP will be disassociated and then this apply() will exit early since no EIP is associated
+                warn!("Node {} is in Unknown state, disassociating EIP", &name);
+                crate::aws::disassociate_eip(&self.ec2_client, &eip_name).await?;
+                crate::eip::set_status_detached(&eip_api, &eip_name).await?;


I believe this is race-y... many apply fns for different nodes can run concurrently by default, and this set_status operation doesn't include Kube's resourceVersion to perform optimistic locking

as a result, you could have:

Node A has the EIP and is marked Unknown

Node B spins up and the reconciler runs set_status_should_attach to attach it to B

The reconciler for Node A runs set_status_detached and removes it from B

Both reconcilers finish and the EIP is not marked as attached to either node 😱

a Kube replace_status would add in some optimistic locking so the two reconcilers cannot overwrite each other, though I haven't thought much about how hard the error cases would be to reason about

This is now in place per use replace_status on set_status_detached

pH14 · 2024-08-29T14:32:36Z

eip_operator/src/controller/node.rs

+
+                return Ok(None);
+            }
+            _ => return Ok(None),


do we not need to do anything for the other (true, true, "Unknown") | (true, false, "Unknown") | (false, true, "Unknown") cases?

Worked with Justin on the cases and refactoring of the node controller. I think this is simplified / covered now.

eip_operator/src/controller/node.rs

pH14 · 2024-08-29T14:44:46Z

eip_operator/src/controller/node.rs

+                // This code path runs a limited number of times once a node goes unresponsive.
+                // The EIP will be disassociated and then this apply() will exit early since no EIP is associated
+                warn!("Node {} is in Unknown state, disassociating EIP", &name);
+                crate::aws::disassociate_eip(&self.ec2_client, &eip_name).await?;


if for any reason the eip-operator crashed between executing disassociate_eip and set_status_detached, would we be able to reconcile and clear the status correctly?

@pH14 Good thought! We should probably remove the resource-id label first! This would allow the EIP to be reclaimed by another node (or potentially the same node).

We might also want to swap the order in the node controller.
In the node controller, if the node is ready, it should see if it has an EIP already claimed and ensure it's attached to that eip. If no eip it's not attached to an EIP it should attempt to find one, and claim it, then it should attempt to associate and update the other status fields. (this can probably be done in a second PR, but if we do it this way it may remove where we can't fix the resource ID of the eip if it got removed, but the EIP is still associated.

I like this idea! Leaving for a separate PR as discussed.

…eanup

nodes

pH14

Feels like this is getting close! Left a few more comments, but the structure here feels much easier to grok

pH14 · 2024-09-03T14:39:09Z

eip_operator/src/main.rs

+            kube_runtime::watcher::Config::default().labels(EGRESS_GATEWAY_NODE_SELECTOR_LABEL_KEY);
+
+        task::spawn(async move {
+            let mut watcher = pin!(kube_runtime::watcher(node_api.clone(), watch_config));


hm I'd think we'd want to watch Eip resources rather than Node, since we set the Eip's status each time the Eip is associated or disassociated. it's not clear to me that this fires if we associate an Eip right now

separately, this watcher is also self-triggering, where setting the egress gateway status will trigger another reconciliation. not a correctness issue if it converges, but could create some read/write amplification or hard-to-debug issues if it does anything unexpected

Good find! The watcher is now correctly configured on Eip resources.

eip_operator/src/egress.rs

* cleanup egress_ready label * only detach don't disassociate EIP on unresponsive nodes

eip_operator/src/controller/node.rs

eip_operator/src/controller/eip.rs

eip_operator/src/controller/pod.rs

eip_operator/src/eip.rs

pH14

LGTM! This really feels you got to the essence of the problem here, the control loops feel very well-scoped, and the label_egress_nodes is super clean.

eip_operator/src/eip.rs

eip_operator/src/controller/node.rs

eip_operator_shared/src/lib.rs

jubrad · 2024-09-03T14:41:08Z

README.md

@@ -1,3 +1,4 @@
+⚠️⚠️ **WARNING!** ⚠️⚠️ The Materialize K8s-eip-operator will soon be archived and no longer be under active development.


probably enough to say that it will soon be archived or will soon be archived and is no longer under active development.

Cleaned up the wording per node improvements. I left in the warning portion as I found that in another MZ public archive as a practice.

jubrad · 2024-09-07T05:13:03Z

eip_operator/src/eip.rs

-                eip.meta_mut().resource_version = eip_v1.metadata.resource_version.clone();
+                let resource_version = eip_v1.metadata.resource_version.clone();
+                eip.meta_mut().resource_version = resource_version;


Why is this changing here?

After updating Rust I ran in to a clippy error. This change fixes:

assigning the result of Clone::clone() may be inefficient

jubrad · 2024-09-07T05:28:48Z

eip_operator/src/egress.rs

+    // However, unresponsive nodes may still exist with a ready-status egress label
+    // set to "true". This allows the old node to still serve traffic if possible until a
+    // new node is ready to take over.


I don't think this agress with

// We immediately disassociate EIPs from egress nodes that are in an // unresponsive state in the eip node controller.

If the EIP has been disassociated from a node it should not be serving traffic, we shouldn't wait to remove the status.

I'm not saying we want to do this, but if this is the behavior we're looking for I think we'd have to do something like adding a label to the EIP that says it can be reclaimed (or removing the eip resource id), but not actually detach it when a node becomes unresponsive.

Discussed over slack and covered by the egress labeling and node controller refactors. We agreed to set the status of the EIP to detached to free the claim. Egress will continue to attempt to go out the unresponsive node on the EIP as it will still be attached to the instance.

jubrad · 2024-09-07T05:47:09Z

eip_operator/src/egress.rs

+    eip_api: &Api<Eip>,
+    node_api: &Api<Node>,
+) -> Result<(), Error> {
+    let egress_nodes = get_egress_nodes(&Api::all(node_api.clone().into())).await?;


We probably don't want to get all egress nodes, just the nodes that would match the EIP being reconciled.

I think the following should work.

let egress_nodes = node_api.list(&eip.get_resource_list_params()).await?.items;

This will break if the the eip is a pod filtering eip, we could return early if we see that.

I didn't realize this function existed - much cleaner!. Covered in egress improvements

jubrad · 2024-09-07T06:00:08Z

eip_operator/src/egress.rs

+        .into_iter()
+        .filter_map(|eip| eip.status.and_then(|s| s.resource_id))
+        .collect();
+


If we want to return early if there isn't an attachment that's ready we can return early if there isn't an attachment here, then I think we can just check if the attachment for the current eip is ready, if so we ensure it's marked as egress_ready, and everything else is marked as egress_unready. If it's not ready we do nothing.

https://github.com/MaterializeInc/k8s-eip-operator/pull/474/files#diff-d5b1eb5e62a747acb5112f073726318730e004f6ce190d0e2f1c11b3868cf8ffR108-R116

This is covered by the refactoring of the egress labeler.

https://github.com/MaterializeInc/k8s-eip-operator/pull/474/files#diff-d5b1eb5e62a747acb5112f073726318730e004f6ce190d0e2f1c11b3868cf8ffR50

jubrad · 2024-09-12T03:01:57Z

eip_operator/src/egress.rs

+    // Note(Evan): find nodes that match eips we're reconciling
+    // if eip has a resource id, see if the node with the resoruce is ready
+    // if no, do nothing
+    // if yes, mark that node as egress_ready=true, and mark all other nodes as egress_ready=false


Suggested change

// Note(Evan): find nodes that match eips we're reconciling

// if eip has a resource id, see if the node with the resoruce is ready

// if no, do nothing

// if yes, mark that node as egress_ready=true, and mark all other nodes as egress_ready=false

// If EIP being reconciled has a resource id, find it's node, and check if it's ready.

// If not ready, return early.

// If ready, mark that node as egress_ready=true, and mark all other nodes as egress_ready=false.

Makes sense. Covered in egress improvements

jubrad · 2024-09-12T03:05:17Z

eip_operator/src/egress.rs

+            for other_node in egress_nodes
+                .iter()
+                .filter(|n| n.name_unchecked() != node.name_unchecked())
+            {
+                add_gateway_status_label(node_api, other_node.name_unchecked().as_str(), "false")
+                    .await?;
+            }


you already have a for each here, you may as well just

for other_node in egress_nodes { // skip the node we just set to ready if other_node.name_unchecked() != node.name_unchecked()) { add_gateway_status_label(node_api, other_node.name_unchecked().as_str(), "false") .await?; } }

I see now I already have the current node name. Adjusted as part of egress improvements

jubrad · 2024-09-12T03:16:16Z

eip_operator/src/egress.rs

+/// Retrieve all egress nodes in the cluster.
+async fn get_egress_nodes(api: &Api<Node>) -> Result<Vec<Node>, kube::Error> {
+    let params = ListParams::default().labels(
+        format!(
+            "{}={}",
+            EGRESS_GATEWAY_NODE_SELECTOR_LABEL_KEY, EGRESS_GATEWAY_NODE_SELECTOR_LABEL_VALUE
+        )
+        .as_str(),
+    );
+
+    match api.list(&params).await {
+        Ok(node_list) => Ok(node_list.items),
+        Err(e) => Err(e),
+    }
+}


We probably don't need this if we're using the node selector on the eip.

Much cleaner and removed the need for some CONST labels in egress improvements

jubrad · 2024-09-12T03:23:16Z

eip_operator/src/controller/node.rs

+            // An Unknown ready status could mean the node is unresponsive or experienced a hardware failure.
+            // A NotReady ready status could mean the node is experiencing a network issue.
+            _ => {
+                // Skip detachment if no EIP is associated with this node.


This comment is confusing..

If an EIP has a resource id label pointing to this node, remove that label releasing this nodes claim to the EIP

I agree. I changed the wording as part of node improvements

jubrad · 2024-09-12T03:25:53Z

eip_operator/src/controller/node.rs

+                if let Some(eip) = node_eip {
+                    warn!(
+                        "Node {} is in an unresponsive state, detaching EIP {}",
+                        &name.clone(),
+                        &eip.name_unchecked()
+                    );
+                    crate::eip::set_status_detached(&eip_api, eip).await?;
+
+                    return Ok(None);
+                }


If we moved this return out of the if and into the match block (with an appropriate comment/debug message) we could could avoid having to later check if the node is ready in line 134.

https://github.com/MaterializeInc/k8s-eip-operator/pull/474/files#diff-48a05fa05e5fac935d49f8d16c17290587389f1821d5e5d237c624224763fc7fR134

Much cleaner. I was glad to be able to remove the ready_status check from further on in the code. I simplified it in node improvements

jubrad

Approving, but I think you should change dbg! to debug! before merging

jubrad · 2024-09-13T21:25:52Z

eip_operator/src/controller/node.rs

            }
+
+            dbg!("Node {} is not ready, skipping EIP claim", &name);


should this be debug!

You're right! Just pushed up a fix.

evanharmon self-assigned this Aug 27, 2024

evanharmon requested a review from a team as a code owner August 27, 2024 23:18

evanharmon marked this pull request as draft August 27, 2024 23:19

evanharmon force-pushed the evan/node-responsiveness-updates branch 2 times, most recently from 50f1429 to 2909126 Compare August 28, 2024 19:17

evanharmon added 2 commits August 28, 2024 15:30

manage gateway status label and disassociate EIP on node ready status…

e11d09e

… Unknown

update README

be070fb

evanharmon force-pushed the evan/node-responsiveness-updates branch from 2909126 to be070fb Compare August 28, 2024 19:31

update openssl and fix Dockerfile rust version

901df8a

evanharmon marked this pull request as ready for review August 29, 2024 14:08

evanharmon requested review from pH14 and jubrad August 29, 2024 14:09

pH14 reviewed Aug 29, 2024

View reviewed changes

evanharmon added 3 commits August 29, 2024 15:51

clean up typos, remove unnecessary async, and simplify egress node cl…

1b1d204

…eanup

add egress control loop for setting egress status label on unresponsive

c0dd103

nodes

move egress label setting of new nodes out of controller

e536898

pH14 reviewed Sep 3, 2024

View reviewed changes

evanharmon added 7 commits September 4, 2024 17:40

update packages and adjust egress node labeling

13c5c17

move to rust 1.78 and upgrade more dependencies

344a595

use replace_status on set_status_detached

f7abdbe

lint fixes and bump pr workflow to rust 1.78

b2ea119

address remaining clippy warnings

b3ee8f2

adjust node controller:

f3171cc

* cleanup egress_ready label * only detach don't disassociate EIP on unresponsive nodes

refactor egress labeler

ea2d3cb

evanharmon commented Sep 12, 2024

View reviewed changes

eip_operator/src/controller/node.rs Show resolved Hide resolved

evanharmon commented Sep 12, 2024

View reviewed changes

eip_operator/src/controller/eip.rs Show resolved Hide resolved

evanharmon commented Sep 12, 2024

View reviewed changes

eip_operator/src/controller/pod.rs Show resolved Hide resolved

evanharmon commented Sep 12, 2024

View reviewed changes

eip_operator/src/eip.rs Show resolved Hide resolved

pH14 approved these changes Sep 12, 2024

View reviewed changes

evanharmon commented Sep 12, 2024

View reviewed changes

eip_operator/src/eip.rs Show resolved Hide resolved

evanharmon commented Sep 12, 2024

View reviewed changes

eip_operator/src/controller/node.rs Outdated Show resolved Hide resolved

evanharmon commented Sep 12, 2024

View reviewed changes

eip_operator_shared/src/lib.rs Show resolved Hide resolved

jubrad reviewed Sep 12, 2024

View reviewed changes

evanharmon added 2 commits September 13, 2024 12:27

egress improvements

20f85c3

node improvements

74aa125

jubrad approved these changes Sep 13, 2024

View reviewed changes

fix debug call

31221c8

jubrad approved these changes Sep 16, 2024

View reviewed changes

evanharmon merged commit 06f16b8 into main Sep 16, 2024
2 checks passed

evanharmon deleted the evan/node-responsiveness-updates branch September 16, 2024 14:34

		@@ -1,3 +1,4 @@
		⚠️⚠️ WARNING! ⚠️⚠️ The Materialize K8s-eip-operator will soon be archived and no longer be under active development.

handle unresponsive nodes #474

handle unresponsive nodes #474

Conversation

evanharmon commented Aug 27, 2024 • edited Loading

Motivation

Notes for reviewer

Testing

pH14 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pH14 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pH14 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jubrad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evanharmon commented Aug 27, 2024 •

edited

Loading